Data Analysis

Week 7

Jafet Belmont & Jenn Gaskell

School of Mathematics and Statistics

ILOs

By the end of this session you will be able to:

  • Fit a logistic regression model with either a numerical or categorical explanatory variable:

  • Interpret the regression coefficients of the logistic model in terms of their effects on both the odds and the log odds (of the response)

    • embed and compile R -code.

    • Include external and self-generated plots/figures.

    • Create tables and write latex-mathematical notations.

  • Authoring presentation using quarto with some interactive features.

Assessment - Overview

Release Topics Assessment Weights Due_Date
3 Weeks: 1-3 Quiz 1 15 4
3 Research Peer assessment 20 9 (End of Submission Stage)
10 (End of Assessment Stage)
6 Weeks:4-6 Quiz 2 15 8
9 Weeks: 1-8 Class test 25 9
4 Weeks: 1-8 Groups project 25 11

Upcoming assessments

Moodle Quiz 2

  • Online quiz assessing the material from week 4 thru 8

    • Release: after today’s lab

    • Due on week(s): Week 8 (Mar 07) 11:00 am BST

      • No late submission are allowed.

Peer assessment

  • Research a randomly assigned topic and record an (up to) 15 minute presentation

    • Due on week(s):

      9 (End of Submission Stage)
      10 (End of Assessment Stage).

      No late submission are allowed.

Upcoming assessments

Class test

  • Conduct a Data analysis and write a short report in quarto using the techniques covered in the course

    • Due on week(s): 9

    • Timed: 1 and a half hour 1

    • Intructions for the class test will become available a week prior to the test. Please read them them before the class test.

    • The specific tasks for the class test and the corresponding files will become available on the the date of the test.

Upcoming assessments

Groups project

  • Using collaborative coding, carry out data analysis on a given dataset and then write a report.

    • Group Allocations will be available online after the class.

    • Setup GitHub Repository by Week 8 (Mar 07) 11:00 am BST

    • Report submission due on week(s): 11

    • Contribution evaluations and Declaration of Originality

Overview of today’s session

Logistic regression

This week we will learn how to model outcomes of interest that take one of two categorical values (e.g. yes/no, success/failure, alive/dead), i.e.

  • binary, taking the value 1 (say success, with probability \(p_i\)) or 0 (failure, with probability \(1-p_i\)) or

In this case,

\[ y_i \sim \mathrm{Bin}(1,p_i)\\ g(p_i) = \log \left(\frac{p_i}{1 - p_i} \right) \]

which is also referred to as the log-odds (since \(p_i ~ / ~ 1-p_i\) is an odds ratio).

\[p_i = \frac{\exp\left(\mathbf{x}_i^\top \boldsymbol{\beta}\right)}{1 + \exp\left(\mathbf{x}_i^\top \boldsymbol{\beta}\right)} ~~~ \in [0, 1].\]

Required R packages

Before we proceed, load all the packages needed for this week:

library(tidyr)
library(ggplot2)
library(moderndive)
library(sjPlot)
library(tidymodels)
library(broom)
library(performance)
library(janitor)

First example - Teaching evaluation scores

evals data set available from the moderndive R package.

Code
evals.gender <- evals %>%
                  select(gender, age)

Fitting a logistic regression in R

\[ \log\left( \frac{p}{1-p} \right) = \alpha + \beta \]

where \(p = \textrm{Prob}\left(\textrm{Male}\right)\) and \(1 - p = \textrm{Prob}\left(\textrm{Female}\right)\).

model <- glm(gender ~ age, data = evals.gender,family = binomial)
model %>% broom::tidy(conf.int = TRUE, conf.level = 0.95)
term estimate std.error statistic p.value
(Intercept) -2.6979460 0.5119379 -5.270066 1e-07
age 0.0629647 0.0105852 5.948386 0e+00

Relationship between between Odds and Probabilities

Table 1: Relationship between Odds and Probabilities
Scale Equivalence
Odds \[ Odds = \mathrm{exp}(log Odds) = \dfrac{P(event)}{1-P(event)} \]
Probability \[ P(event) =\dfrac{\mathrm{exp}(logOdds)}{1+\mathrm{exp}(logOdds)} = \dfrac{Odds}{1+Odds} \]

Model evaluation

library(performance)
check_model(model, panel = TRUE)

Predictive performance metrics

How well our model predicts new observations?

  • compute the predicted classes and compare them against the observed values.

  • We typically classify these probabilities into discrete classes based on a threshold (commonly 0.5 for binary classification)We can set a threshold

Code
pred_results = model %>% 
  augment(type.predict = c("response")) %>%
  mutate(predicted_class = 
           factor(ifelse(.fitted > 0.5, "male", "female")))
gender age .fitted .resid .hat .sigma .cooksd .std.resid predicted_class
female 36 0.3938360 -1.0006045 0.0058192 1.132909 0.0019126 -1.003529 female
female 36 0.3938360 -1.0006045 0.0058192 1.132909 0.0019126 -1.003529 female
female 36 0.3938360 -1.0006045 0.0058192 1.132909 0.0019126 -1.003529 female
female 36 0.3938360 -1.0006045 0.0058192 1.132909 0.0019126 -1.003529 female
male 59 0.7343825 0.7857803 0.0047899 1.133280 0.0008746 0.787669 male

Predictive performance metrics

We can use these predicted classes to compute different predictive performance/evaluation metrics

The table can be interpreted as follows: - The correct classification rate (CCR) or accuracy describes the overall proportion of teaching instructors (males or females) that were classified correctly.

  • The true positive rate (TPR) or sensitivity (a.k.a. recall), denotes the proportion of actual male instructors that are correctly classified as males by the model.

  • The true negative rate (TNR) or specificity, denotes the proportion of actual females that have been classified correctly as females by the model.

  • The model’s precision or positive predictive value (PPV) represents the proportion of predicted male instructors that were actually male

  • The model’s negative predictive value (NPV) represents the proportion of predicted female instructors .

ROC Curve

  • Plot true positive rate (sensitivity) and the false positive rate (1 - specificity) at various threshold levels.

  • The closer the ROC curve is to the top-left corner, the better the model is at distinguishing between the positive and negative classes

Logistic regression with one categorical explanatory variable

Instead of having a numerical explanatory variable such as age, let’s now use the binary categorical variable ethnicity as our explanatory variable.

Fitting the model

evals.ethnic <- evals %>%
                  select(gender, ethnicity)
model.ethnic <- glm(gender ~ ethnicity,
                    data = evals.ethnic,
                    family = binomial) 
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -0.25 0.25 -1.00 0.32 -0.75 0.24
ethnicitynot minority 0.66 0.27 2.44 0.01 0.13 1.20

Interpretation of model parameters

Lets break this down. The model we have fitted is:

\[ \mathrm{log}\left(\dfrac{p_i}{1-p_i}\right) = \alpha + \beta_{\mbox{ethnicity}} \times \mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) \]

  • \(\alpha\) is the intercept, representing the log-odds when \(\mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) = 0\) (i.e., when the instructor is in the minority group).

    • When the instructor belongs to the reference category minority the models simplifies to: \[\mathrm{log}\left(\frac{p_i}{1-p_i}\right) = \alpha \]
  • \(\beta_{\mathrm{ethnicity}}\) is the coefficient for the predictor \(\mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority})\), which shows how the log-odds change when moving from the reference category (minority) to the other level (not minority).

    • When the instructor does not belong to reference category, i.e. \(\mathbb{I}_{\mathrm{ethnicity}}(\mathrm{not~ minority}) = 1\), the model becomes: \[\mathrm{log}\left(\dfrac{p_i}{1-p_i}\right) = \alpha + \beta_{\mbox{ethnicity}}\]

So, the log-odds of the instructors being male in the not minority group are \(\alpha +\beta_{\mbox{ethnicity}}\).

The steps ahead

  • Calculate, using R, the logistic regression coefficients on the odds and probability scales when either a continuous or a categorical exploratory variable is used in the model.

  • Correct interpretation of the of the model parameters of a fitted logistic regression (in terms of the log-odds, odds and probabilities) with either a continuous or a categorical exploratory variable .

  • Check how to visualize the results of a fitted logistic regression with with either a continuous or a categorical exploratory variable.

  • Interpret the model diagnostic plots and predictive performance metrics of a logistic regression model.